Language Detection from Speech: Chinese or English?
Sun 15 Oct 2017 by Tianlong Song Tags Machine Learning Natural Language ProcessingIn language processing, it is an essential step to detect which language it is before speech recognition and machine translation. This blog post presents an approach to distinguish Chinese and English from speech (an audio sample) using a neural network model. Spark is used to perform data preprocessing, and TensorFlow is used for neural network model training and evaluation.
Raw Data Collection
YouTube videos (with audio extracted) are downloaded and converted to wav format. The data are collected from two representative interview shows in each language (Chinese and English), and they are:
- 635 minutes of Chinese interviews from Luyu Official (i.e., 《鲁豫有约》)
- 534 minutes of English interviews from Ellen Show
Data Preprocessing
The data preprocessing converts a wav audio file into a spectrogram image by the following steps:
- Split audio into pieces of one second for each;
- Re-sample (down-sample) to make sure each audio piece has the same sampling rate (16k);
- Apply Mel Frequency Cepstral Coefficient (MFCC) filter to obtain the spectrogram of the audio;
- Convert the spectrogram into a gray-scale image;
- Improve the contrast of images by applying histogram equalization and levels filter;
- Cut or pad to make sure each image has the same size.
To enable parallelization, Spark is used to execute the steps described above. Most steps are easy to understand, while Step 3 needs a little bit more explanation here. The MFCC filter essentially mimics the functionality of human cochleae, by framing, calculating power spectrum and summing over different mel-spaced filter banks for each audio sample. See here for more details, especially how mel-spaced filter banks are generated.
At last, all spectrogram images are labelled, mixed, shuffled and then split into train/test tests by 80%/20%. After that, we have:
- Train set: 30497 spectrogram images for Chinese, and 25663 spectrogram images for English
- Test set: 7625 spectrogram images for Chinese, and 6416 spectrogram images for English
Model Training and Evaluation
A Berlinnet neural network model is adopted from here to perform the classification. The model contains 12 layers (in the order they appear in the network):
- One input layer
- One convolutional layer
- One local response normalization layer
- One pooling layer
- One convolutional layer
- One local response normalization layer
- One pooling layer
- One convolutional layer
- One local response normalization layer
- One pooling layer
- One fully connected layer
- One fully connected layer
- One output layer
The model is trained and evaluated using the TensorFlow framework. The configuration of the training as well as the model itself can be found here, and the results are discussed in the next section.
Results & Discussion
Due to limited resources on a regular PC (that is what I have), there was only 19300 iterations performed during the training step, which took around 24 hours. However, the evaluation on the test set delivered an accuracy of as high as 92.7%. It should be noted that the classification is merely based on a very short audio sample (lasting for one second only).
There are at least three potential ways to make the accuracy even higher:
- Collect more data to train the model;
- Apply more iterations during training, if more resources are available;
- Most likely, an utterance lasts for more than one second, which gives us a chance to apply majority voting across the classification results drawn independently from multiple one-second audio pieces. A weighted voting is even better and meanwhile definitely doable, as each classification returns not only a label but also the confidence.
The last approach turns out to be a very effective way to boost the classification accuracy, which barely consumes any additional resource but can reduce the classification error dramatically.
Code Repository
Check out the codes, if you are interested in the implementation.
Acknowledgment
This project is inspired by and a large portion of codes comes from the great work here.